Computation and Language 64
☆ CoGen: Learning from Feedback with Coupled Comprehension and Generation
Systems with both language comprehension and generation capabilities can
benefit from the tight connection between the two. This work studies coupling
comprehension and generation with focus on continually learning from
interaction with users. We propose techniques to tightly integrate the two
capabilities for both learning and inference. We situate our studies in
two-player reference games, and deploy various models for thousands of
interactions with human users, while learning from interaction feedback
signals. We show dramatic improvements in performance over time, with
comprehension-generation coupling leading to performance improvements up to 26%
in absolute terms and up to 17% higher accuracies compared to a non-coupled
system. Our analysis also shows coupling has substantial qualitative impact on
the system's language, making it significantly more human-like.
comment: 17 pages, 9 figures
☆ BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems
Large Language Models (LLMs) are becoming increasingly powerful and capable
of handling complex tasks, e.g., building single agents and multi-agent
systems. Compared to single agents, multi-agent systems have higher
requirements for the collaboration capabilities of language models. Many
benchmarks are proposed to evaluate their collaborative abilities. However,
these benchmarks lack fine-grained evaluations of LLM collaborative
capabilities. Additionally, multi-agent collaborative and competitive scenarios
are ignored in existing works. To address these two problems, we propose a
benchmark, called BattleAgentBench, which defines seven sub-stages of three
varying difficulty levels and conducts a fine-grained evaluation of language
models in terms of single-agent scenario navigation capabilities, paired-agent
task execution abilities, and multi-agent collaboration and competition
capabilities. We conducted extensive evaluations on leading four closed-source
and seven open-source models. Experimental results indicate that API-based
models perform excellently on simple tasks but open-source small models
struggle with simple tasks. Regarding difficult tasks that require
collaborative and competitive abilities, although API-based models have
demonstrated some collaborative capabilities, there is still enormous room for
improvement.
☆ More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding
Enabling Large Language Models (LLMs) to comprehend the 3D physical world
remains a significant challenge. Due to the lack of large-scale 3D-text pair
datasets, the success of LLMs has yet to be replicated in 3D understanding. In
this paper, we rethink this issue and propose a new task: 3D Data-Efficient
Point-Language Understanding. The goal is to enable LLMs to achieve robust 3D
object understanding with minimal 3D point cloud and text data pairs. To
address this task, we introduce GreenPLM, which leverages more text data to
compensate for the lack of 3D data. First, inspired by using CLIP to align
images and text, we utilize a pre-trained point cloud-text encoder to map the
3D point cloud space to the text space. This mapping leaves us to seamlessly
connect the text space with LLMs. Once the point-text-LLM connection is
established, we further enhance text-LLM alignment by expanding the
intermediate text space, thereby reducing the reliance on 3D point cloud data.
Specifically, we generate 6M free-text descriptions of 3D objects, and design a
three-stage training strategy to help LLMs better explore the intrinsic
connections between different modalities. To achieve efficient modality
alignment, we design a zero-parameter cross-attention module for token pooling.
Extensive experimental results show that GreenPLM requires only 12% of the 3D
training data used by existing state-of-the-art models to achieve superior 3D
understanding. Remarkably, GreenPLM also achieves competitive performance using
text-only data. The code and weights are available at:
https://github.com/TangYuan96/GreenPLM.
☆ Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models
Yuncheng Yang, Yulei Qin, Tong Wu, Zihan Xu, Gang Li, Pengcheng Guo, Hang Shao, Yucheng Shi, Ke Li, Xing Sun, Jie Yang, Yun Gu
The cultivation of expertise for large language models (LLMs) to solve tasks
of specific areas often requires special-purpose tuning with calibrated
behaviors on the expected stable outputs. To avoid huge cost brought by manual
preparation of instruction datasets and training resources up to hundreds of
hours, the exploitation of open knowledge including a wealth of low rank
adaptation (LoRA) models and instruction datasets serves as a good starting
point. However, existing methods on model and data selection focus on the
performance of general-purpose capabilities while neglecting the knowledge gap
exposed in domain-specific deployment. In the present study, we propose to
bridge such gap by introducing few human-annotated samples (i.e., K-shot) for
advancing task expertise of LLMs with open knowledge. Specifically, we develop
an efficient and scalable pipeline to cost-efficiently produce task experts
where K-shot data intervene in selecting the most promising expert candidates
and the task-relevant instructions. A mixture-of-expert (MoE) system is built
to make the best use of individual-yet-complementary knowledge between multiple
experts. We unveil the two keys to the success of a MoE system, 1) the abidance
by K-shot, and 2) the insistence on diversity. For the former, we ensure that
models that truly possess problem-solving abilities on K-shot are selected
rather than those blind guessers. Besides, during data selection, instructions
that share task-relevant contexts with K-shot are prioritized. For the latter,
we highlight the diversity of constituting experts and that of the fine-tuning
instructions throughout the model and data selection process. Extensive
experimental results confirm the superiority of our approach over existing
methods on utilization of open knowledge across various tasks. Codes and models
will be released later.
comment: 28 pages, 12 tables, 10 figures
☆ LLM-Based Multi-Hop Question Answering with Knowledge Graph Integration in Evolving Environments
Ruirui Chen, Weifeng Jiang, Chengwei Qin, Ishaan Singh Rawal, Cheston Tan, Dongkyu Choi, Bo Xiong, Bo Ai
The rapid obsolescence of information in Large Language Models (LLMs) has
driven the development of various techniques to incorporate new facts. However,
existing methods for knowledge editing still face difficulties with multi-hop
questions that require accurate fact identification and sequential logical
reasoning, particularly among numerous fact updates. To tackle these
challenges, this paper introduces Graph Memory-based Editing for Large Language
Models (GMeLLo), a straitforward and effective method that merges the explicit
knowledge representation of Knowledge Graphs (KGs) with the linguistic
flexibility of LLMs. Beyond merely leveraging LLMs for question answering,
GMeLLo employs these models to convert free-form language into structured
queries and fact triples, facilitating seamless interaction with KGs for rapid
updates and precise multi-hop reasoning. Our results show that GMeLLo
significantly surpasses current state-of-the-art knowledge editing methods in
the multi-hop question answering benchmark, MQuAKE, especially in scenarios
with extensive knowledge edits.
☆ Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts
Efficiency, specialization, and adaptability to new data distributions are
qualities that are hard to combine in current Large Language Models. The
Mixture of Experts (MoE) architecture has been the focus of significant
research because its inherent conditional computation enables such desirable
properties. In this work, we focus on "upcycling" dense expert models into an
MoE, aiming to improve specialization while also adding the ability to adapt to
new tasks easily. We introduce Nexus, an enhanced MoE architecture with
adaptive routing where the model learns to project expert embeddings from
domain representations. This approach allows Nexus to flexibly add new experts
after the initial upcycling through separately trained dense models, without
requiring large-scale MoE training for unseen data domains. Our experiments
show that Nexus achieves a relative gain of up to 2.1% over the baseline for
initial upcycling, and a 18.8% relative gain for extending the MoE with a new
expert by using limited finetuning data. This flexibility of Nexus is crucial
to enable an open-source ecosystem where every user continuously assembles
their own MoE-mix according to their needs.
☆ A New Method for Cross-Lingual-based Semantic Role Labeling
Semantic role labeling is a crucial task in natural language processing,
enabling better comprehension of natural language. However, the lack of
annotated data in multiple languages has posed a challenge for researchers. To
address this, a deep learning algorithm based on model transfer has been
proposed. The algorithm utilizes a dataset consisting of the English portion of
CoNLL2009 and a corpus of semantic roles in Persian. To optimize the efficiency
of training, only ten percent of the educational data from each language is
used. The results of the proposed model demonstrate significant improvements
compared to Niksirt et al.'s model. In monolingual mode, the proposed model
achieved a 2.05 percent improvement on F1-score, while in cross-lingual mode,
the improvement was even more substantial, reaching 6.23 percent. Worth noting
is that the compared model only trained two of the four stages of semantic role
labeling and employed golden data for the remaining two stages. This suggests
that the actual superiority of the proposed model surpasses the reported
numbers by a significant margin. The development of cross-lingual methods for
semantic role labeling holds promise, particularly in addressing the scarcity
of annotated data for various languages. These advancements pave the way for
further research in understanding and processing natural language across
different linguistic contexts.
☆ Bias in LLMs as Annotators: The Effect of Party Cues on Labelling Decision by Large Language Models
Human coders are biased. We test similar biases in Large Language Models
(LLMs) as annotators. By replicating an experiment run by Ennser-Jedenastik and
Meyer (2018), we find evidence that LLMs use political information, and
specifically party cues, to judge political statements. Not only do LLMs use
relevant information to contextualize whether a statement is positive,
negative, or neutral based on the party cue, they also reflect the biases of
the human-generated data upon which they have been trained. We also find that
unlike humans, who are only biased when faced with statements from extreme
parties, LLMs exhibit significant bias even when prompted with statements from
center-left and center-right parties. The implications of our findings are
discussed in the conclusion.
☆ Persuasion Games using Large Language Models
Large Language Models (LLMs) have emerged as formidable instruments capable
of comprehending and producing human-like text. This paper explores the
potential of LLMs, to shape human perspectives and subsequently influence their
decisions on particular tasks. This capability finds applications in diverse
domains such as Investment, Credit cards and Insurance, wherein they assist
users in selecting appropriate insurance policies, investment plans, Credit
cards, Retail, as well as in Behavioral Change Support Systems (BCSS).
We present a sophisticated multi-agent framework wherein a consortium of
agents operate in collaborative manner. The primary agent engages directly with
users through persuasive dialogue, while the auxiliary agents perform tasks
such as information retrieval, response analysis, development of persuasion
strategies, and validation of facts. Empirical evidence from our experiments
demonstrates that this collaborative methodology significantly enhances the
persuasive efficacy of the LLM. We analyze user resistance to persuasive
efforts continuously and counteract it by employing a combination of rule-based
and LLM-based resistance-persuasion mapping techniques.
We employ simulated personas and generate conversations in insurance,
banking, and retail domains to evaluate the proficiency of large language
models (LLMs) in recognizing, adjusting to, and influencing various personality
types. Concurrently, we examine the resistance mechanisms employed by LLM
simulated personas. Persuasion is quantified via measurable surveys before and
after interaction, LLM-generated scores on conversation, and user decisions
(purchase or non-purchase).
☆ Knowledge Navigator: LLM-guided Browsing Framework for Exploratory Search in Scientific Literature
The exponential growth of scientific literature necessitates advanced tools
for effective knowledge exploration. We present Knowledge Navigator, a system
designed to enhance exploratory search abilities by organizing and structuring
the retrieved documents from broad topical queries into a navigable, two-level
hierarchy of named and descriptive scientific topics and subtopics. This
structured organization provides an overall view of the research themes in a
domain, while also enabling iterative search and deeper knowledge discovery
within specific subtopics by allowing users to refine their focus and retrieve
additional relevant documents. Knowledge Navigator combines LLM capabilities
with cluster-based methods to enable an effective browsing method. We
demonstrate our approach's effectiveness through automatic and manual
evaluations on two novel benchmarks, CLUSTREC-COVID and SCITOC. Our code,
prompts, and benchmarks are made publicly available.
☆ Automatic Differential Diagnosis using Transformer-Based Multi-Label Sequence Classification
As the field of artificial intelligence progresses, assistive technologies
are becoming more widely used across all industries. The healthcare industry is
no different, with numerous studies being done to develop assistive tools for
healthcare professionals. Automatic diagnostic systems are one such beneficial
tool that can assist with a variety of tasks, including collecting patient
information, analyzing test results, and diagnosing patients. However, the idea
of developing systems that can provide a differential diagnosis has been
largely overlooked in most of these research studies. In this study, we propose
a transformer-based approach for providing differential diagnoses based on a
patient's age, sex, medical history, and symptoms. We use the DDXPlus dataset,
which provides differential diagnosis information for patients based on 49
disease types. Firstly, we propose a method to process the tabular patient data
from the dataset and engineer them into patient reports to make them suitable
for our research. In addition, we introduce two data modification modules to
diversify the training data and consequently improve the robustness of the
models. We approach the task as a multi-label classification problem and
conduct extensive experiments using four transformer models. All the models
displayed promising results by achieving over 97% F1 score on the held-out test
set. Moreover, we design additional behavioral tests to get a broader
understanding of the models. In particular, for one of our test cases, we
prepared a custom test set of 100 samples with the assistance of a doctor. The
results on the custom set showed that our proposed data modification modules
improved the model's generalization capabilities. We hope our findings will
provide future researchers with valuable insights and inspire them to develop
reliable systems for automatic differential diagnosis.
comment: 25 pages, 7 figures
☆ Scaling Up Summarization: Leveraging Large Language Models for Long Text Extractive Summarization
In an era where digital text is proliferating at an unprecedented rate,
efficient summarization tools are becoming indispensable. While Large Language
Models (LLMs) have been successfully applied in various NLP tasks, their role
in extractive text summarization remains underexplored. This paper introduces
EYEGLAXS (Easy Yet Efficient larGe LAnguage model for eXtractive
Summarization), a framework that leverages LLMs, specifically LLAMA2-7B and
ChatGLM2-6B, for extractive summarization of lengthy text documents. Instead of
abstractive methods, which often suffer from issues like factual inaccuracies
and hallucinations, EYEGLAXS focuses on extractive summarization to ensure
factual and grammatical integrity. Utilizing state-of-the-art techniques such
as Flash Attention and Parameter-Efficient Fine-Tuning (PEFT), EYEGLAXS
addresses the computational and resource challenges typically associated with
LLMs. The system sets new performance benchmarks on well-known datasets like
PubMed and ArXiv. Furthermore, we extend our research through additional
analyses that explore the adaptability of LLMs in handling different sequence
lengths and their efficiency in training on smaller datasets. These
contributions not only set a new standard in the field but also open up
promising avenues for future research in extractive text summarization.
☆ Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough ICML 2024
We investigate continued pretraining of LLMs for language adaptation on a
tight academic budget: a setting in which only a few GPUs can be used in
parallel, for a heavily constrained duration. We focus on adapting Mistral-7B
to German or Arabic and evaluate several techniques to improve efficiency and
effectiveness in this setting. Our German models adapted on this tight compute
budget underperform compared to the base Mistral-7B, while our Arabic models
outperform several baselines, showing that for sufficiently well-represented
languages, continued pretraining for specialization is not always helpful. Our
main findings focus on training precision and tokenizer swapping. Our results
show that pure bfloat16 training is a viable alternative to mixed-precision
training, while being much faster when only using a few GPUs. Swapping the
tokenizer for a specialized one yields more efficient tokenization and is
competitive with the original tokenizer, which already contains some German
tokens, but did not significantly increase performance for German. Code and
model weights are available at on GitHub.
comment: WANT@ICML 2024
☆ Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM Interactions
Virtual counselors powered by large language models (LLMs) aim to create
interactive support systems that effectively assist clients struggling with
mental health challenges. To replicate counselor-client conversations,
researchers have built an online mental health platform that allows
professional counselors to provide clients with text-based counseling services
for about an hour per session. Notwithstanding its effectiveness, challenges
exist as human annotation is time-consuming, cost-intensive, privacy-protected,
and not scalable. To address this issue and investigate the applicability of
LLMs in psychological counseling conversation simulation, we propose a
framework that employs two LLMs via role-playing for simulating
counselor-client interactions. Our framework involves two LLMs, one acting as a
client equipped with a specific and real-life user profile and the other
playing the role of an experienced counselor, generating professional responses
using integrative therapy techniques. We implement both the counselor and the
client by zero-shot prompting the GPT-4 model. In order to assess the
effectiveness of LLMs in simulating counselor-client interactions and
understand the disparities between LLM- and human-generated conversations, we
evaluate the synthetic data from various perspectives. We begin by assessing
the client's performance through automatic evaluations. Next, we analyze and
compare the disparities between dialogues generated by the LLM and those
generated by professional counselors. Furthermore, we conduct extensive
experiments to thoroughly examine the performance of our LLM-based counselor
trained with synthetic interactive dialogues by benchmarking against
state-of-the-art models for mental health.
☆ LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models
Jiayi Gui, Yiming Liu, Jiale Cheng, Xiaotao Gu, Xiao Liu, Hongning Wang, Yuxiao Dong, Jie Tang, Minlie Huang
Large Language Models (LLMs) have demonstrated notable capabilities across
various tasks, showcasing complex problem-solving abilities. Understanding and
executing complex rules, along with multi-step planning, are fundamental to
logical reasoning and critical for practical LLM agents and decision-making
systems. However, evaluating LLMs as effective rule-based executors and
planners remains underexplored. In this paper, we introduce LogicGame, a novel
benchmark designed to evaluate the comprehensive rule understanding, execution,
and planning capabilities of LLMs. Unlike traditional benchmarks, LogicGame
provides diverse games that contain a series of rules with an initial state,
requiring models to comprehend and apply predefined regulations to solve
problems. We create simulated scenarios in which models execute or plan
operations to achieve specific outcomes. These game scenarios are specifically
designed to distinguish logical reasoning from mere knowledge by relying
exclusively on predefined rules. This separation allows for a pure assessment
of rule-based reasoning capabilities. The evaluation considers not only final
outcomes but also intermediate steps, providing a comprehensive assessment of
model performance. Moreover, these intermediate steps are deterministic and can
be automatically verified. LogicGame defines game scenarios with varying
difficulty levels, from simple rule applications to complex reasoning chains,
in order to offer a precise evaluation of model performance on rule
understanding and multi-step execution. Utilizing LogicGame, we test various
LLMs and identify notable shortcomings in their rule-based logical reasoning
abilities.
☆ A Survey on Evaluation of Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) mimic human perception and reasoning
system by integrating powerful Large Language Models (LLMs) with various
modality encoders (e.g., vision, audio), positioning LLMs as the "brain" and
various modality encoders as sensory organs. This framework endows MLLMs with
human-like capabilities, and suggests a potential pathway towards achieving
artificial general intelligence (AGI). With the emergence of all-round MLLMs
like GPT-4V and Gemini, a multitude of evaluation methods have been developed
to assess their capabilities across different dimensions. This paper presents a
systematic and comprehensive review of MLLM evaluation methods, covering the
following key aspects: (1) the background of MLLMs and their evaluation; (2)
"what to evaluate" that reviews and categorizes existing MLLM evaluation tasks
based on the capabilities assessed, including general multimodal recognition,
perception, reasoning and trustworthiness, and domain-specific applications
such as socioeconomic, natural sciences and engineering, medical usage, AI
agent, remote sensing, video and audio processing, 3D point cloud analysis, and
others; (3) "where to evaluate" that summarizes MLLM evaluation benchmarks into
general and specific benchmarks; (4) "how to evaluate" that reviews and
illustrates MLLM evaluation steps and metrics; Our overarching goal is to
provide valuable insights for researchers in the field of MLLM evaluation,
thereby facilitating the development of more capable and reliable MLLMs. We
emphasize that evaluation should be regarded as a critical discipline,
essential for advancing the field of MLLMs.
☆ Harmonized Speculative Sampling
Speculative sampling has proven to be an effective solution to accelerate
decoding from large language models, where the acceptance rate significantly
determines the performance. Most previous works on improving the acceptance
rate focus on aligned training and efficient decoding, implicitly paying less
attention to the linkage of training and decoding. In this work, we first
investigate the linkage of training and decoding for speculative sampling and
then propose a solution named HArmonized Speculative Sampling (HASS). HASS
improves the acceptance rate without extra inference overhead by harmonizing
training and decoding on their objectives and contexts. Experiments on three
LLaMA models demonstrate that HASS achieves 2.81x-3.65x wall-clock time speedup
ratio averaging across three datasets, which is 8%-15% faster than EAGLE-2.
☆ Form and meaning co-determine the realization of tone in Taiwan Mandarin spontaneous speech: the case of Tone 3 sandhi
In Standard Chinese, Tone 3 (the dipping tone) becomes Tone 2 (rising tone)
when followed by another Tone 3. Previous studies have noted that this sandhi
process may be incomplete, in the sense that the assimilated Tone 3 is still
distinct from a true Tone 2. While Mandarin Tone 3 sandhi is widely studied
using carefully controlled laboratory speech (Xu, 1997) and more formal
registers of Beijing Mandarin (Yuan and Chen, 2014), less is known about its
realization in spontaneous speech, and about the effect of contextual factors
on tonal realization. The present study investigates the pitch contours of
two-character words with T2-T3 and T3-T3 tone patterns in spontaneous Taiwan
Mandarin conversations. Our analysis makes use of the Generative Additive Mixed
Model (GAMM, Wood, 2017) to examine fundamental frequency (f0) contours as a
function of normalized time. We consider various factors known to influence
pitch contours, including gender, speaking rate, speaker, neighboring tones,
word position, bigram probability, and also novel predictors, word and word
sense (Chuang et al., 2024). Our analyses revealed that in spontaneous Taiwan
Mandarin, T3-T3 words become indistinguishable from T2-T3 words, indicating
complete sandhi, once the strong effect of word (or word sense) is taken into
account. For our data, the shape of f0 contours is not co-determined by word
frequency. In contrast, the effect of word meaning on f0 contours is robust, as
strong as the effect of adjacent tones, and is present for both T2-T3 and T3-T3
words.
☆ LM-PUB-QUIZ: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models
Knowledge probing evaluates the extent to which a language model (LM) has
acquired relational knowledge during its pre-training phase. It provides a
cost-effective means of comparing LMs of different sizes and training setups
and is useful for monitoring knowledge gained or lost during continual learning
(CL). In prior work, we presented an improved knowledge probe called BEAR
(Wiland et al., 2024), which enables the comparison of LMs trained with
different pre-training objectives (causal and masked LMs) and addresses issues
of skewed distributions in previous probes to deliver a more unbiased reading
of LM knowledge. With this paper, we present LM-PUB- QUIZ, a Python framework
and leaderboard built around the BEAR probing mechanism that enables
researchers and practitioners to apply it in their work. It provides options
for standalone evaluation and direct integration into the widely-used training
pipeline of the Hugging Face TRANSFORMERS library. Further, it provides a
fine-grained analysis of different knowledge types to assist users in better
understanding the knowledge in each evaluated LM. We publicly release
LM-PUB-QUIZ as an open-source project.
☆ An Evaluation of Sindhi Word Embedding in Semantic Analogies and Downstream Tasks
In this paper, we propose a new word embedding based corpus consisting of
more than 61 million words crawled from multiple web resources. We design a
preprocessing pipeline for the filtration of unwanted text from crawled data.
Afterwards, the cleaned vocabulary is fed to state-of-the-art
continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms. For
the evaluation of pretrained embeddings, we use popular intrinsic and extrinsic
evaluation approaches. The evaluation results reveal that
continuous-bag-of-words and skip-gram perform better than GloVe and existing
Sindhi fastText word embedding on both intrinsic and extrinsic evaluation
approaches
comment: arXiv admin note: substantial text overlap with arXiv:1911.12579
☆ Conan-embedding: General Text Embedding with More and Better Negative Samples
With the growing popularity of RAG, the capabilities of embedding models are
gaining increasing attention. Embedding models are primarily trained through
contrastive loss learning, with negative examples being a key component.
Previous work has proposed various hard negative mining strategies, but these
strategies are typically employed as preprocessing steps. In this paper, we
propose the conan-embedding model, which maximizes the utilization of more and
higher-quality negative examples. Specifically, since the model's ability to
handle preprocessed negative examples evolves during training, we propose
dynamic hard negative mining method to expose the model to more challenging
negative examples throughout the training process. Secondly, contrastive
learning requires as many negative examples as possible but is limited by GPU
memory constraints. Therefore, we use a Cross-GPU balancing Loss to provide
more negative examples for embedding training and balance the batch size across
multiple tasks. Moreover, we also discovered that the prompt-response pairs
from LLMs can be used for embedding training. Our approach effectively enhances
the capabilities of embedding models, currently ranking first on the Chinese
leaderboard of Massive text embedding benchmark
☆ TempoFormer: A Transformer for Temporally-aware Representations in Change Detection
Dynamic representation learning plays a pivotal role in understanding the
evolution of linguistic content over time. On this front both context and time
dynamics as well as their interplay are of prime importance. Current approaches
model context via pre-trained representations, which are typically temporally
agnostic. Previous work on modeling context and temporal dynamics has used
recurrent methods, which are slow and prone to overfitting. Here we introduce
TempoFormer, the fist task-agnostic transformer-based and temporally-aware
model for dynamic representation learning. Our approach is jointly trained on
inter and intra context dynamics and introduces a novel temporal variation of
rotary positional embeddings. The architecture is flexible and can be used as
the temporal representation foundation of other models or applied to different
transformer-based architectures. We show new SOTA performance on three
different real-time change detection tasks.
☆ StyleRemix: Interpretable Authorship Obfuscation via Distillation and Perturbation of Style Elements
Authorship obfuscation, rewriting a text to intentionally obscure the
identity of the author, is an important but challenging task. Current methods
using large language models (LLMs) lack interpretability and controllability,
often ignoring author-specific stylistic features, resulting in less robust
performance overall.
To address this, we develop StyleRemix, an adaptive and interpretable
obfuscation method that perturbs specific, fine-grained style elements of the
original input text. StyleRemix uses pre-trained Low Rank Adaptation (LoRA)
modules to rewrite an input specifically along various stylistic axes (e.g.,
formality and length) while maintaining low computational cost. StyleRemix
outperforms state-of-the-art baselines and much larger LLMs in a variety of
domains as assessed by both automatic and human evaluation.
Additionally, we release AuthorMix, a large set of 30K high-quality,
long-form texts from a diverse set of 14 authors and 4 domains, and DiSC, a
parallel corpus of 1,500 texts spanning seven style axes in 16 unique
directions
☆ Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
For Mixture-of-Experts (MoE) models, an unbalanced expert load will lead to
routing collapse or increased computational overhead. Existing methods commonly
employ an auxiliary loss to encourage load balance, but a large auxiliary loss
will introduce non-negligible interference gradients into training and thus
impair the model performance. In order to control load balance while not
producing undesired gradients during training, we propose Loss-Free Balancing,
featured by an auxiliary-loss-free load balancing strategy. To be specific,
before the top-K routing decision, Loss-Free Balancing will first apply an
expert-wise bias to the routing scores of each expert. By dynamically updating
the bias of each expert according to its recent load, Loss-Free Balancing can
consistently maintain a balanced distribution of expert load. In addition,
since Loss-Free Balancing does not produce any interference gradients, it also
elevates the upper bound of model performance gained from MoE training. We
validate the performance of Loss-Free Balancing on MoE models with up to 3B
parameters trained on up to 200B tokens. Experimental results show that
Loss-Free Balancing achieves both better performance and better load balance
compared with traditional auxiliary-loss-controlled load balancing strategies.
☆ Harnessing the Intrinsic Knowledge of Pretrained Language Models for Challenging Text Classification Settings
Text classification is crucial for applications such as sentiment analysis
and toxic text filtering, but it still faces challenges due to the complexity
and ambiguity of natural language. Recent advancements in deep learning,
particularly transformer architectures and large-scale pretraining, have
achieved inspiring success in NLP fields. Building on these advancements, this
thesis explores three challenging settings in text classification by leveraging
the intrinsic knowledge of pretrained language models (PLMs). Firstly, to
address the challenge of selecting misleading yet incorrect distractors for
cloze questions, we develop models that utilize features based on
contextualized word representations from PLMs, achieving performance that
rivals or surpasses human accuracy. Secondly, to enhance model generalization
to unseen labels, we create small finetuning datasets with domain-independent
task label descriptions, improving model performance and robustness. Lastly, we
tackle the sensitivity of large language models to in-context learning prompts
by selecting effective demonstrations, focusing on misclassified examples and
resolving model ambiguity regarding test example labels.
comment: PhD thesis
☆ CBF-LLM: Safe Control for LLM Alignment
This paper proposes a control-based framework for aligning large language
models (LLMs) by leveraging a control barrier function (CBF) to ensure
user-desirable text generation. The presented framework applies the safety
filter, designed based on the CBF, to the output generation of the baseline
LLM, i.e., the sequence of the token, with the aim of intervening in the
generated text. The overall text-generation system is implemented with Llama 3
and a RoBERTa model, and the source code is available at
https://github.com/Mya-Mya/CBF-LLM. The experiment demonstrates its control
ability and effectiveness in reducing the number of interventions needed for
user-specified alignment tasks.
☆ Beyond Levenshtein: Leveraging Multiple Algorithms for Robust Word Error Rate Computations And Granular Error Classifications INTERSPEECH 2024
The Word Error Rate (WER) is the common measure of accuracy for Automatic
Speech Recognition (ASR). Transcripts are usually pre-processed by substituting
specific characters to account for non-semantic differences. As a result of
this normalisation, information on the accuracy of punctuation or
capitalisation is lost. We present a non-destructive, token-based approach
using an extended Levenshtein distance algorithm to compute a robust WER and
additional orthographic metrics. Transcription errors are also classified more
granularly by existing string similarity and phonetic algorithms. An evaluation
on several datasets demonstrates the practical equivalence of our approach
compared to common WER computations. We also provide an exemplary analysis of
derived use cases, such as a punctuation error rate, and a web application for
interactive use and visualisation of our implementation. The code is available
open-source.
comment: Accepted in INTERSPEECH 2024
☆ SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models
There is a growing trend of teaching large language models (LLMs) to solve
mathematical problems through coding. Existing studies primarily focus on
prompting powerful, closed-source models to generate seed training data
followed by in-domain data augmentation, equipping LLMs with considerable
capabilities for code-aided mathematical reasoning. However, continually
training these models on augmented data derived from a few datasets such as
GSM8K may impair their generalization abilities and restrict their
effectiveness to a narrow range of question types. Conversely, the potential of
improving such LLMs by leveraging large-scale, expert-written, diverse math
question-answer pairs remains unexplored. To utilize these resources and tackle
unique challenges such as code response assessment, we propose a novel paradigm
that uses a code-based critic model to guide steps including question-code data
construction, quality control, and complementary evaluation. We also explore
different alignment algorithms with self-generated instruction/preference data
to foster continuous improvement. Experiments across both in-domain (up to
+5.7%) and out-of-domain (+4.4%) benchmarks in English and Chinese demonstrate
the effectiveness of the proposed paradigm.
☆ Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation AAAI 2025
Lossless speculative decoding accelerates target large language model (LLM)
inference by employing a lightweight draft model for generating tree-structured
candidates, which are subsequently verified in parallel by the target LLM.
Currently, effective approaches leverage feature-level rather than token-level
autoregression within the draft model to facilitate more straightforward
predictions and enhanced knowledge distillation. In this paper, we reassess
these approaches and propose FSPAD (Feature Sampling and Partial Alignment
Distillation for Lossless Speculative Decoding), which introduces two
straightforward and effective components within the existing framework to boost
lossless speculative decoding. Firstly, FSPAD utilizes token embeddings to
sample features of the target LLM in high-dimensional space before feeding them
into the draft model, due to the inherent uncertainty of the features
preventing the draft model from obtaining the specific token output by the
target LLM. Secondly, FSPAD introduces partial alignment distillation to weaken
the draft model's connection between features and logits, aiming to reduce the
conflict between feature alignment and logit confidence during training. Our
experiments include both greedy and non-greedy decoding on the largest and
smallest models from the Vicuna and LLaMA3-Instruct series, as well as tasks in
multi-turn conversation, translation, summarization, question answering,
mathematical reasoning, and retrieval-augmented generation. The results show
that FSPAD outperforms the state-of-the-art method across all the
aforementioned tasks and target LLMs.
comment: The work was not submitted to AAAI 2025
☆ WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback
Taiwei Shi, Zhuoer Wang, Longqi Yang, Ying-Chun Lin, Zexue He, Mengting Wan, Pei Zhou, Sujay Jauhar, Xiaofeng Xu, Xia Song, Jennifer Neville
As large language models (LLMs) continue to advance, aligning these models
with human preferences has emerged as a critical challenge. Traditional
alignment methods, relying on human or LLM annotated datasets, are limited by
their resource-intensive nature, inherent subjectivity, and the risk of
feedback loops that amplify model biases. To overcome these limitations, we
introduce WildFeedback, a novel framework that leverages real-time, in-situ
user interactions to create preference datasets that more accurately reflect
authentic human values. WildFeedback operates through a three-step process:
feedback signal identification, preference data construction, and user-guided
evaluation. We applied this framework to a large corpus of user-LLM
conversations, resulting in a rich preference dataset that reflects genuine
user preferences. This dataset captures the nuances of user preferences by
identifying and classifying feedback signals within natural conversations,
thereby enabling the construction of more representative and context-sensitive
alignment data. Our extensive experiments demonstrate that LLMs fine-tuned on
WildFeedback exhibit significantly improved alignment with user preferences, as
evidenced by both traditional benchmarks and our proposed user-guided
evaluation. By incorporating real-time feedback from actual users, WildFeedback
addresses the scalability, subjectivity, and bias challenges that plague
existing approaches, marking a significant step toward developing LLMs that are
more responsive to the diverse and evolving needs of their users. In summary,
WildFeedback offers a robust, scalable solution for aligning LLMs with true
human values, setting a new standard for the development and evaluation of
user-centric language models.
comment: 24 pages
☆ SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding
Sihang Li, Jian Huang, Jiaxi Zhuang, Yaorui Shi, Xiaochen Cai, Mingjun Xu, Xiang Wang, Linfeng Zhang, Guolin Ke, Hengxing Cai
Scientific literature understanding is crucial for extracting targeted
information and garnering insights, thereby significantly advancing scientific
discovery. Despite the remarkable success of Large Language Models (LLMs), they
face challenges in scientific literature understanding, primarily due to (1) a
lack of scientific knowledge and (2) unfamiliarity with specialized scientific
tasks.
To develop an LLM specialized in scientific literature understanding, we
propose a hybrid strategy that integrates continual pre-training (CPT) and
supervised fine-tuning (SFT), to simultaneously infuse scientific domain
knowledge and enhance instruction-following capabilities for domain-specific
tasks.cIn this process, we identify two key challenges: (1) constructing
high-quality CPT corpora, and (2) generating diverse SFT instructions. We
address these challenges through a meticulous pipeline, including PDF text
extraction, parsing content error correction, quality filtering, and synthetic
instruction creation. Applying this strategy, we present a suite of LLMs:
SciLitLLM, specialized in scientific literature understanding. These models
demonstrate promising performance on scientific literature understanding
benchmarks.
Our contributions are threefold: (1) We present an effective framework that
integrates CPT and SFT to adapt LLMs to scientific literature understanding,
which can also be easily adapted to other domains. (2) We propose an LLM-based
synthesis method to generate diverse and high-quality scientific instructions,
resulting in a new instruction set -- SciLitIns -- for supervised fine-tuning
in less-represented scientific domains. (3) SciLitLLM achieves promising
performance improvements on scientific literature understanding benchmarks.
☆ An Investigation of Warning Erroneous Chat Translations in Cross-lingual Communication
The complexities of chats pose significant challenges for machine translation
models. Recognizing the need for a precise evaluation metric to address the
issues of chat translation, this study introduces Multidimensional Quality
Metrics for Chat Translation (MQM-Chat). Through the experiments of five models
using MQM-Chat, we observed that all models generated certain fundamental
errors, while each of them has different shortcomings, such as omission, overly
correcting ambiguous source content, and buzzword issues, resulting in the loss
of stylized information. Our findings underscore the effectiveness of MQM-Chat
in evaluating chat translation, emphasizing the importance of stylized content
and dialogue consistency for future studies.
☆ LRP4RAG: Detecting Hallucinations in Retrieval-Augmented Generation via Layer-wise Relevance Propagation
Retrieval-Augmented Generation (RAG) has become a primary technique for
mitigating hallucinations in large language models (LLMs). However, incomplete
knowledge extraction and insufficient understanding can still mislead LLMs to
produce irrelevant or even contradictory responses, which means hallucinations
persist in RAG. In this paper, we propose LRP4RAG, a method based on the
Layer-wise Relevance Propagation (LRP) algorithm for detecting hallucinations
in RAG. Specifically, we first utilize LRP to compute the relevance between the
input and output of the RAG generator. We then apply further extraction and
resampling to the relevance matrix. The processed relevance data are input into
multiple classifiers to determine whether the output contains hallucinations.
To the best of our knowledge, this is the first time that LRP has been used for
detecting RAG hallucinations, and extensive experiments demonstrate that
LRP4RAG outperforms existing baselines.
☆ Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models
This paper presents Dolphin, a novel decoder-decoder architecture for
energy-efficient processing of long contexts in language models. Our approach
addresses the significant energy consumption and latency challenges inherent in
on-device models. Dolphin employs a compact 0.5B parameter decoder to distill
extensive contextual information into a memory embedding, substantially
reducing the input length for the primary 7B parameter decoder model. Inspired
by vision-language models, we repurpose the image embedding projector to encode
long textual contexts, effectively treating extended context as a distinct
modality. This innovative method enables processing of substantially longer
contexts without the typical computational overhead associated with extended
input sequences. Empirical evaluations demonstrate a 10-fold improvement in
energy efficiency and a 5-fold reduction in latency compared to conventional
full-length context processing methods without losing quality of the response.
Our work contributes to the development of more sustainable and scalable
language models for on-device applications, addressing the critical need for
energy-efficient and responsive AI technologies in resource-constrained
environments while maintaining the accuracy to understand long contexts. This
research has implications for the broader field of natural language processing,
particularly in the domain of efficient model design for resource-limited
settings. By enabling more sophisticated AI capabilities on edge devices,
Dolphin paves the way for advanced language processing in a wide range of
applications where computational resources are at a premium. The Dolphin model
is publicly available at https://huggingface.co/NexaAIDev/Dolphin.
☆ Towards Fully Autonomous Research Powered by LLMs: Case Study on Simulations
The advent of Large Language Models (LLMs) has created new opportunities for
the automation of scientific research, spanning both experimental processes and
computational simulations. This study explores the feasibility of constructing
an autonomous simulation agent (ASA) powered by LLM, through sophisticated API
integration, to automate the entire research process, from experimental design,
remote upload and simulation execution, data analysis, to report compilation.
Using a simulation problem of polymer chain conformations as a case study, we
assessed the performance of ASAs powered by different LLMs including
GPT-4-Turbo. Our findings revealed that ASA-GPT-4o achieved near-flawless
execution on designated research missions, underscoring the potential of LLMs
to manage complete scientific investigations autonomously. The outlined
automation can be iteratively performed up to twenty cycles without human
intervention, illustrating the potential of LLMs for large-scale autonomous
research endeavors. Additionally, we discussed the intrinsic traits of ASAs in
managing extensive tasks, focusing on self-validation mechanisms and the
balance between local attention and global oversight.
comment: For additional code and data, please visit our GitHub repository:
https://github.com/zokaraa/autonomous_simulation_agent
☆ Measuring the Reliability of Causal Probing Methods: Tradeoffs, Limitations, and the Plight of Nullifying Interventions
Causal probing is an approach to interpreting foundation models, such as
large language models, by training probes to recognize latent properties of
interest from embeddings, intervening on probes to modify this representation,
and analyzing the resulting changes in the model's behavior. While some recent
works have cast doubt on the theoretical basis of several leading causal
probing intervention methods, it has been unclear how to systematically and
empirically evaluate their effectiveness in practice. To address this problem,
we propose a general empirical analysis framework to evaluate the reliability
of causal probing interventions, formally defining and quantifying two key
causal probing desiderata: completeness (fully transforming the representation
of the target property) and selectivity (minimally impacting other properties).
Our formalism allows us to make the first direct comparisons between different
families of causal probing methods (e.g., linear vs. nonlinear or
counterfactual vs. nullifying interventions). We conduct extensive experiments
across several leading methods, finding that (1) there is an inherent tradeoff
between these criteria, and no method is able to consistently satisfy both at
once; and (2) across the board, nullifying interventions are always far less
complete than counterfactual interventions, indicating that nullifying methods
may not be an effective approach to causal probing.
☆ ReMamba: Equip Mamba with Effective Long-Sequence Modeling
While the Mamba architecture demonstrates superior inference efficiency and
competitive performance on short-context natural language processing (NLP)
tasks, empirical evidence suggests its capacity to comprehend long contexts is
limited compared to transformer-based models. In this study, we investigate the
long-context efficiency issues of the Mamba models and propose ReMamba, which
enhances Mamba's ability to comprehend long contexts. ReMamba incorporates
selective compression and adaptation techniques within a two-stage re-forward
process, incurring minimal additional inference costs overhead. Experimental
results on the LongBench and L-Eval benchmarks demonstrate ReMamba's efficacy,
improving over the baselines by 3.2 and 1.6 points, respectively, and attaining
performance almost on par with same-size transformer models.
☆ Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression
Large Language Models (LLMs) have garnered widespread attention due to their
remarkable performance across various tasks. However, to mitigate the issue of
hallucinations, LLMs often incorporate retrieval-augmented pipeline to provide
them with rich external knowledge and context. Nevertheless, challenges stem
from inaccurate and coarse-grained context retrieved from the retriever.
Supplying irrelevant context to the LLMs can result in poorer responses,
increased inference latency, and higher costs. This paper introduces a method
called Instruction-Aware Contextual Compression, which filters out less
informative content, thereby accelerating and enhancing the use of LLMs. The
experimental results demonstrate that Instruction-Aware Contextual Compression
notably reduces memory consumption and minimizes generation latency while
maintaining performance levels comparable to those achieved with the use of the
full context. Specifically, we achieved a 50% reduction in context-related
costs, resulting in a 5% reduction in inference memory usage and a 2.2-fold
increase in inference speed, with only a minor drop of 0.047 in Rouge-1. These
findings suggest that our method strikes an effective balance between
efficiency and performance.
comment: 20 pages
☆ Legilimens: Practical and Unified Content Moderation for Large Language Model Services CCS
Given the societal impact of unsafe content generated by large language
models (LLMs), ensuring that LLM services comply with safety standards is a
crucial concern for LLM service providers. Common content moderation methods
are limited by an effectiveness-and-efficiency dilemma, where simple models are
fragile while sophisticated models consume excessive computational resources.
In this paper, we reveal for the first time that effective and efficient
content moderation can be achieved by extracting conceptual features from
chat-oriented LLMs, despite their initial fine-tuning for conversation rather
than content moderation. We propose a practical and unified content moderation
framework for LLM services, named Legilimens, which features both effectiveness
and efficiency. Our red-team model-based data augmentation enhances the
robustness of Legilimens against state-of-the-art jailbreaking. Additionally,
we develop a framework to theoretically analyze the cost-effectiveness of
Legilimens compared to other methods. We have conducted extensive experiments
on five host LLMs, seventeen datasets, and nine jailbreaking methods to verify
the effectiveness, efficiency, and robustness of Legilimens against normal and
adaptive adversaries. A comparison of Legilimens with both commercial and
academic baselines demonstrates the superior performance of Legilimens.
Furthermore, we confirm that Legilimens can be applied to few-shot scenarios
and extended to multi-label classification tasks.
comment: Accepted by ACM Conference on Computer and Communications Security
(CCS) 2024
♻ ☆ Flextron: Many-in-One Flexible Large Language Model
Ruisi Cai, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov
Training modern LLMs is extremely resource intensive, and customizing them
for various deployment scenarios characterized by limited compute and memory
resources through repeated training is impractical. In this paper, we introduce
Flextron, a network architecture and post-training model optimization framework
supporting flexible model deployment. The Flextron architecture utilizes a
nested elastic structure to rapidly adapt to specific user-defined latency and
accuracy targets during inference with no additional fine-tuning required. It
is also input-adaptive, and can automatically route tokens through its
sub-networks for improved performance and efficiency. We present a
sample-efficient training method and associated routing algorithms for
systematically transforming an existing trained LLM into a Flextron model. We
evaluate Flextron on the GPT-3 and LLama-2 family of LLMs, and demonstrate
superior performance over multiple end-to-end trained variants and other
state-of-the-art elastic networks, all with a single pretraining run that
consumes a mere 7.63% tokens compared to original pretraining.
♻ ☆ Towards Human-Level Text Coding with LLMs: The Case of Fatherhood Roles in Public Policy Documents
Recent advances in large language models (LLMs) like GPT-3.5 and GPT-4
promise automation with better results and less programming, opening up new
opportunities for text analysis in political science. In this study, we
evaluate LLMs on three original coding tasks involving typical complexities
encountered in political science settings: a non-English language, legal and
political jargon, and complex labels based on abstract constructs. Along the
paper, we propose a practical workflow to optimize the choice of the model and
the prompt. We find that the best prompting strategy consists of providing the
LLMs with a detailed codebook, as the one provided to human coders. In this
setting, an LLM can be as good as or possibly better than a human annotator
while being much faster, considerably cheaper, and much easier to scale to
large amounts of text. We also provide a comparison of GPT and popular
open-source LLMs, discussing the trade-offs in the model's choice. Our software
allows LLMs to be easily used as annotators and is publicly available:
https://github.com/lorelupo/pappa.
♻ ☆ HC3 Plus: A Semantic-Invariant Human ChatGPT Comparison Corpus CIKM2023
ChatGPT has garnered significant interest due to its impressive performance;
however, there is growing concern about its potential risks, particularly in
the detection of AI-generated content (AIGC), which is often challenging for
untrained individuals to identify. Current datasets used for detecting
ChatGPT-generated text primarily focus on question-answering tasks, often
overlooking tasks with semantic-invariant properties, such as summarization,
translation, and paraphrasing. In this paper, we demonstrate that detecting
model-generated text in semantic-invariant tasks is more challenging. To
address this gap, we introduce a more extensive and comprehensive dataset that
incorporates a wider range of tasks than previous work, including those with
semantic-invariant properties.
comment: This paper has been accepted by CIKM2023 workshop
♻ ☆ From Complexity to Clarity: How AI Enhances Perceptions of Scientists and the Public's Understanding of Science
This paper evaluated the effectiveness of using generative AI to simplify
science communication and enhance the public's understanding of science. By
comparing lay summaries of journal articles from PNAS, yoked to those generated
by AI, this work first assessed linguistic simplicity differences across such
summaries and public perceptions in follow-up experiments. Specifically, Study
1a analyzed simplicity features of PNAS abstracts (scientific summaries) and
significance statements (lay summaries), observing that lay summaries were
indeed linguistically simpler, but effect size differences were small. Study 1b
used a large language model, GPT-4, to create significance statements based on
paper abstracts and this more than doubled the average effect size without
fine-tuning. Study 2 experimentally demonstrated that simply-written GPT
summaries facilitated more favorable perceptions of scientists (they were
perceived as more credible and trustworthy, but less intelligent) than more
complexly-written human PNAS summaries. Crucially, Study 3 experimentally
demonstrated that participants comprehended scientific writing better after
reading simple GPT summaries compared to complex PNAS summaries. In their own
words, participants also summarized scientific papers in a more detailed and
concrete manner after reading GPT summaries compared to PNAS summaries of the
same article. AI has the potential to engage scientific communities and the
public via a simple language heuristic, advocating for its integration into
scientific dissemination for a more informed society.
comment: 17 pages
♻ ☆ RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Armand Joulin, Noah Fiedel, Evan Senter, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, David Budden, Arnaud Doucet, Sharad Vikram, Adam Paszke, Trevor Gale, Sebastian Borgeaud, Charlie Chen, Andy Brock, Antonia Paterson, Jenny Brennan, Meg Risdal, Raj Gundluru, Nesh Devanathan, Paul Mooney, Nilay Chauhan, Phil Culliton, Luiz Gustavo Martins, Elisa Bandy, David Huntsperger, Glenn Cameron, Arthur Zucker, Tris Warkentin, Ludovic Peran, Minh Giang, Zoubin Ghahramani, Clément Farabet, Koray Kavukcuoglu, Demis Hassabis, Raia Hadsell, Yee Whye Teh, Nando de Frietas
We introduce RecurrentGemma, a family of open language models which uses
Google's novel Griffin architecture. Griffin combines linear recurrences with
local attention to achieve excellent performance on language. It has a
fixed-sized state, which reduces memory use and enables efficient inference on
long sequences. We provide two sizes of models, containing 2B and 9B
parameters, and provide pre-trained and instruction tuned variants for both.
Our models achieve comparable performance to similarly-sized Gemma baselines
despite being trained on fewer tokens.
♻ ☆ A Statistical Framework of Watermarks for Large Language Models: Pivot, Detection Efficiency and Optimal Rules
Since ChatGPT was introduced in November 2022, embedding (nearly)
unnoticeable statistical signals into text generated by large language models
(LLMs), also known as watermarking, has been used as a principled approach to
provable detection of LLM-generated text from its human-written counterpart. In
this paper, we introduce a general and flexible framework for reasoning about
the statistical efficiency of watermarks and designing powerful detection
rules. Inspired by the hypothesis testing formulation of watermark detection,
our framework starts by selecting a pivotal statistic of the text and a secret
key -- provided by the LLM to the verifier -- to enable controlling the false
positive rate (the error of mistakenly detecting human-written text as
LLM-generated). Next, this framework allows one to evaluate the power of
watermark detection rules by obtaining a closed-form expression of the
asymptotic false negative rate (the error of incorrectly classifying
LLM-generated text as human-written). Our framework further reduces the problem
of determining the optimal detection rule to solving a minimax optimization
program. We apply this framework to two representative watermarks -- one of
which has been internally implemented at OpenAI -- and obtain several findings
that can be instrumental in guiding the practice of implementing watermarks. In
particular, we derive optimal detection rules for these watermarks under our
framework. These theoretically derived detection rules are demonstrated to be
competitive and sometimes enjoy a higher power than existing detection
approaches through numerical experiments.
♻ ☆ Downstream bias mitigation is all you need
The advent of transformer-based architectures and large language models
(LLMs) have significantly advanced the performance of natural language
processing (NLP) models. Since these LLMs are trained on huge corpuses of data
from the web and other sources, there has been a major concern about harmful
prejudices that may potentially be transferred from the data. In many
applications, these pre-trained LLMs are fine-tuned on task specific datasets,
which can further contribute to biases. This paper studies the extent of biases
absorbed by LLMs during pre-training as well as task-specific behaviour after
fine-tuning. We found that controlled interventions on pre-trained LLMs, prior
to fine-tuning, have minimal effect on lowering biases in classifiers. However,
the biases present in domain-specific datasets play a much bigger role, and
hence mitigating them at this stage has a bigger impact. While pre-training
does matter, but after the model has been pre-trained, even slight changes to
co-occurrence rates in the fine-tuning dataset has a significant effect on the
bias of the model.
comment: arXiv admin note: This work has been withdrawn by arXiv
administrators due to inappropriate text reuse from external sources
♻ ☆ Look Before You Leap: Towards Decision-Aware and Generalizable Tool-Usage for Large Language Models
Tool-augmented large language models (LLMs) are attracting widespread
attention when accessing up-to-date knowledge and alleviating hallucination
issues. Nowadays, advanced closed-source LLMs (e.g., ChatGPT) have demonstrated
surprising tool-usage capabilities through prompting and in-context learning
techniques. To empower the capabilities of open-source LLMs (e.g., LLaMA) in
manipulating tools, current efforts focus on either template-driven or
token-triggered tool-usage. However, the former hampers LLMs' flexibility to
address diverse user's queries due to constrained tool interactions, while the
latter limits the generalizability when engaging with new tools, since
tool-usage learning is based on task- and tool-specific datasets. To alleviate
these concerns, in this paper, we propose a decision-aware and generalizable
tool-usage framework (DEER). Specifically, we first construct the tool-usage
samples with multiple decision branches via an automatic generation pipeline,
thereby inspiring the decision-making awareness of LLMs under diverse
scenarios. Meanwhile, we propose a novel tool sampling strategy to enhance the
generalizability of LLMs over unseen tools. Extensive experiments demonstrate
that our proposed DEER is effective and significantly outperforms baselines
across various datasets.
comment: 20 pages, 18 figures
♻ ☆ eRST: A Signaled Graph Theory of Discourse Relations and Organization
In this article we present Enhanced Rhetorical Structure Theory (eRST), a new
theoretical framework for computational discourse analysis, based on an
expansion of Rhetorical Structure Theory (RST). The framework encompasses
discourse relation graphs with tree-breaking, non-projective and concurrent
relations, as well as implicit and explicit signals which give explainable
rationales to our analyses. We survey shortcomings of RST and other existing
frameworks, such as Segmented Discourse Representation Theory (SDRT), the Penn
Discourse Treebank (PDTB) and Discourse Dependencies, and address these using
constructs in the proposed theory. We provide annotation, search and
visualization tools for data, and present and evaluate a freely available
corpus of English annotated according to our framework, encompassing 12 spoken
and written genres with over 200K tokens. Finally, we discuss automatic
parsing, evaluation metrics and applications for data in our framework.
♻ ☆ Unveiling the Statistical Foundations of Chain-of-Thought Prompting Methods
Chain-of-Thought (CoT) prompting and its variants have gained popularity as
effective methods for solving multi-step reasoning problems using pretrained
large language models (LLMs). In this work, we analyze CoT prompting from a
statistical estimation perspective, providing a comprehensive characterization
of its sample complexity. To this end, we introduce a multi-step latent
variable model that encapsulates the reasoning process, where the latent
variable encodes the task information. Under this framework, we demonstrate
that when the pretraining dataset is sufficiently large, the estimator formed
by CoT prompting is equivalent to a Bayesian estimator. This estimator
effectively solves the multi-step reasoning problem by aggregating a posterior
distribution inferred from the demonstration examples in the prompt. Moreover,
we prove that the statistical error of the CoT estimator can be decomposed into
two main components: (i) a prompting error, which arises from inferring the
true task using CoT prompts, and (ii) the statistical error of the pretrained
LLM. We establish that, under appropriate assumptions, the prompting error
decays exponentially to zero as the number of demonstrations increases.
Additionally, we explicitly characterize the approximation and generalization
errors of the pretrained LLM. Notably, we construct a transformer model that
approximates the target distribution of the multi-step reasoning problem with
an error that decreases exponentially in the number of transformer blocks. Our
analysis extends to other variants of CoT, including Self-Consistent CoT,
Tree-of-Thought, and Selection-Inference, offering a broad perspective on the
efficacy of these methods. We also provide numerical experiments to validate
the theoretical findings.
comment: 150 pages, 18 figures, 3 tables
♻ ☆ Stick to your Role! Stability of Personal Values Expressed in Large Language Models
The standard way to study Large Language Models (LLMs) with benchmarks or
psychology questionnaires is to provide many different queries from similar
minimal contexts (e.g. multiple choice questions). However, due to LLMs' highly
context-dependent nature, conclusions from such minimal-context evaluations may
be little informative about the model's behavior in deployment (where it will
be exposed to many new contexts). We argue that context-dependence
(specifically, value stability) should be studied as a specific property of
LLMs and used as another dimension of LLM comparison (alongside others such as
cognitive abilities, knowledge, or model size). We present a case-study on the
stability of value expression over different contexts (simulated conversations
on different topics) as measured using a standard psychology questionnaire
(PVQ) and on behavioral downstream tasks. Reusing methods from psychology, we
study Rank-order stability on the population (interpersonal) level, and
Ipsative stability on the individual (intrapersonal) level. We consider two
settings (with and without instructing LLMs to simulate particular personas),
two simulated populations, and three downstream tasks. We observe consistent
trends in the stability of models and model families - Mixtral, Mistral,
GPT-3.5 and Qwen families are more stable than LLaMa-2 and Phi. The consistency
of these trends implies that some models exhibit higher value stability than
others, and that stability can be estimated with the set of introduced
methodological tools. When instructed to simulate particular personas, LLMs
exhibit low Rank-order stability, which further diminishes with conversation
length. This highlights the need for future research on LLMs that coherently
simulate different personas. This paper provides a foundational step in that
direction, and, to our knowledge, it is the first study of value stability in
LLMs.
comment: The project website and code are available at
https://sites.google.com/view/llmvaluestability Published in PLOS ONE (
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0309114 ),
and a shorter version at CogSci 24 (
https://escholarship.org/uc/item/7w4823c6 )
♻ ☆ Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study
Liuchang Xu, Shuo Zhao, Qingming Lin, Luyao Chen, Qianqian Luo, Sensen Wu, Xinyue Ye, Hailin Feng, Zhenhong Du
The advent of large language models such as ChatGPT, Gemini, and others has
underscored the importance of evaluating their diverse capabilities, ranging
from natural language understanding to code generation. However, their
performance on spatial tasks has not been comprehensively assessed. This study
addresses this gap by introducing a novel multi-task spatial evaluation
dataset, designed to systematically explore and compare the performance of
several advanced models on spatial tasks. The dataset encompasses twelve
distinct task types, including spatial understanding and path planning, each
with verified, accurate answers. We evaluated multiple models, including
OpenAI's gpt-3.5-turbo, gpt-4o, and ZhipuAI's glm-4, through a two-phase
testing approach. Initially, we conducted zero-shot testing, followed by
categorizing the dataset by difficulty and performing prompt tuning tests.
Results indicate that gpt-4o achieved the highest overall accuracy in the first
phase, with an average of 71.3%. Although moonshot-v1-8k slightly
underperformed overall, it surpassed gpt-4o in place name recognition tasks.
The study also highlights the impact of prompt strategies on model performance
in specific tasks. For example, the Chain-of-Thought (COT) strategy increased
gpt-4o's accuracy in path planning from 12.4% to 87.5%, while a one-shot
strategy enhanced moonshot-v1-8k's accuracy in mapping tasks from 10.1% to
76.3%.
♻ ☆ Language-specific Calibration for Pruning Multilingual Language Models
Recent advances in large language model (LLM) pruning have shown
state-of-the-art compression results in post-training and retraining-free
settings while maintaining high predictive performance. However, such research
mainly considers calibrating pruning using English text, despite the
multilingual nature of modern LLMs and their frequent uses in non-English
languages. In this paper, we set out to explore effective strategies for
calibrating the pruning of multilingual language models. We present the first
comprehensive empirical study, comparing different calibration languages for
pruning multilingual models across diverse tasks, models, and state-of-the-art
pruning techniques. Our results present practical suggestions, for example,
calibrating in the target language can efficiently yield lower perplexity, but
does not necessarily benefit downstream tasks. Our further analysis experiments
unveil that calibration in the target language mainly contributes to preserving
language-specific features related to fluency and coherence, but might not
contribute to capturing language-agnostic features such as language
understanding and reasoning. Last, we provide practical recommendations for
future practitioners.
♻ ☆ Evading AI-Generated Content Detectors using Homoglyphs
The advent of large language models (LLMs) has enabled the generation of text
that increasingly exhibits human-like characteristics. As the detection of such
content is of significant importance, numerous studies have been conducted with
the aim of developing reliable AI-generated text detectors. These detectors
have demonstrated promising results on test data, but recent research has
revealed that they can be circumvented by employing different techniques. In
this paper, we present homoglyph-based attacks ($a \rightarrow {\alpha}$) as a
means of circumventing existing detectors. A comprehensive evaluation was
conducted to assess the effectiveness of these attacks on seven detectors,
including ArguGPT, Binoculars, DetectGPT, Fast-DetectGPT, Ghostbuster, OpenAI's
detector, and watermarking techniques, on five different datasets. Our findings
demonstrate that homoglyph-based attacks can effectively circumvent
state-of-the-art detectors, leading them to classify all texts as either
AI-generated or human-written (decreasing the average Matthews Correlation
Coefficient from 0.64 to -0.01). We then examine the effectiveness of these
attacks by analyzing how homoglyphs impact different families of detectors.
Finally, we discuss the implications of these findings and potential defenses
against such attacks.
♻ ☆ Deciphering the Impact of Pretraining Data on Large Language Models through Machine Unlearning ACL 2024
Through pretraining on a corpus with various sources, Large Language Models
(LLMs) have gained impressive performance. However, the impact of each
component of the pretraining corpus remains opaque. As a result, the
organization of the pretraining corpus is still empirical and may deviate from
the optimal. To address this issue, we systematically analyze the impact of 48
datasets from 5 major categories of pretraining data of LLMs and measure their
impacts on LLMs using benchmarks about nine major categories of model
capabilities. Our analyses provide empirical results about the contribution of
multiple corpora on the performances of LLMs, along with their joint impact
patterns, including complementary, orthogonal, and correlational relationships.
We also identify a set of ``high-impact data'' such as Books that is
significantly related to a set of model capabilities. These findings provide
insights into the organization of data to support more efficient pretraining of
LLMs.
comment: Accepted by ACL 2024 Findings
♻ ☆ PASH at TREC 2021 Deep Learning Track: Generative Enhanced Model for Multi-stage Ranking
Yixuan Qiao, Hao Chen, Jun Wang, Tuozhen Liu, Xianbin Ye, Xin Tang, Rui Fang, Peng Gao, Wenfeng Xie, Guotong Xie
This paper describes the PASH participation in TREC 2021 Deep Learning Track.
In the recall stage, we adopt a scheme combining sparse and dense retrieval
method. In the multi-stage ranking phase, point-wise and pair-wise ranking
strategies are used one after another based on model continual pre-trained on
general knowledge and document-level data. Compared to TREC 2020 Deep Learning
Track, we have additionally introduced the generative model T5 to further
enhance the performance.
comment: TREC 2021
♻ ☆ Large Language Model Sentinel: LLM Agent for Adversarial Purification
Over the past two years, the use of large language models (LLMs) has advanced
rapidly. While these LLMs offer considerable convenience, they also raise
security concerns, as LLMs are vulnerable to adversarial attacks by some
well-designed textual perturbations. In this paper, we introduce a novel
defense technique named Large LAnguage MOdel Sentinel (LLAMOS), which is
designed to enhance the adversarial robustness of LLMs by purifying the
adversarial textual examples before feeding them into the target LLM. Our
method comprises two main components: a) Agent instruction, which can simulate
a new agent for adversarial defense, altering minimal characters to maintain
the original meaning of the sentence while defending against attacks; b)
Defense guidance, which provides strategies for modifying clean or adversarial
examples to ensure effective defense and accurate outputs from the target LLMs.
Remarkably, the defense agent demonstrates robust defensive capabilities even
without learning from adversarial examples. Additionally, we conduct an
intriguing adversarial experiment where we develop two agents, one for defense
and one for attack, and engage them in mutual confrontation. During the
adversarial interactions, neither agent completely beat the other. Extensive
experiments on both open-source and closed-source LLMs demonstrate that our
method effectively defends against adversarial attacks, thereby enhancing
adversarial robustness.
♻ ☆ AI-native Memory: A Pathway from LLMs Towards AGI
Large language models (LLMs) have demonstrated the world with the sparks of
artificial general intelligence (AGI). One opinion, especially from some
startups working on LLMs, argues that an LLM with nearly unlimited context
length can realize AGI. However, they might be too optimistic about the
long-context capability of (existing) LLMs -- (1) Recent literature has shown
that their effective context length is significantly smaller than their claimed
context length; and (2) Our reasoning-in-a-haystack experiments further
demonstrate that simultaneously finding the relevant information from a long
context and conducting (simple) reasoning is nearly impossible. In this paper,
we envision a pathway from LLMs to AGI through the integration of
\emph{memory}. We believe that AGI should be a system where LLMs serve as core
processors. In addition to raw data, the memory in this system would store a
large number of important conclusions derived from reasoning processes.
Compared with retrieval-augmented generation (RAG) that merely processing raw
data, this approach not only connects semantically related information closer,
but also simplifies complex inferences at the time of querying. As an
intermediate stage, the memory will likely be in the form of natural language
descriptions, which can be directly consumed by users too. Ultimately, every
agent/person should have its own large personal model, a deep neural network
model (thus \emph{AI-native}) that parameterizes and compresses all types of
memory, even the ones cannot be described by natural languages. Finally, we
discuss the significant potential of AI-native memory as the transformative
infrastructure for (proactive) engagement, personalization, distribution, and
social in the AGI era, as well as the incurred privacy and security challenges
with preliminary solutions.
♻ ☆ SkyScript-100M: 1,000,000,000 Pairs of Scripts and Shooting Scripts for Short Drama
Jing Tang, Quanlu Jia, Yuqiang Xie, Zeyu Gong, Xiang Wen, Jiayi Zhang, Yalong Guo, Guibin Chen, Jiangping Yang
Generating high-quality shooting scripts containing information such as scene
and shot language is essential for short drama script generation. We collect
6,660 popular short drama episodes from the Internet, each with an average of
100 short episodes, and the total number of short episodes is about 80,000,
with a total duration of about 2,000 hours and totaling 10 terabytes (TB). We
perform keyframe extraction and annotation on each episode to obtain about
10,000,000 shooting scripts. We perform 100 script restorations on the
extracted shooting scripts based on our self-developed large short drama
generation model SkyReels. This leads to a dataset containing 1,000,000,000
pairs of scripts and shooting scripts for short dramas, called SkyScript-100M.
We compare SkyScript-100M with the existing dataset in detail and demonstrate
some deeper insights that can be achieved based on SkyScript-100M. Based on
SkyScript-100M, researchers can achieve several deeper and more far-reaching
script optimization goals, which may drive a paradigm shift in the entire field
of text-to-video and significantly advance the field of short drama video
generation. The data and code are available at
https://github.com/vaew/SkyScript-100M.
comment: 18 pages, 12 figures
♻ ☆ SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models
Dongchao Yang, Rongjie Huang, Yuanyuan Wang, Haohan Guo, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng
Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as
an effective method for improving the diversity and naturalness of synthesized
speech. At the high level, previous large-scale TTS models can be categorized
into either Auto-regressive (AR) based (\textit{e.g.}, VALL-E) or
Non-auto-regressive (NAR) based models (\textit{e.g.}, NaturalSpeech 2/3).
Although these works demonstrate good performance, they still have potential
weaknesses. For instance, AR-based models are plagued by unstable generation
quality and slow generation speed; meanwhile, some NAR-based models need
phoneme-level duration alignment information, thereby increasing the complexity
of data pre-processing, model design, and loss design. In this work, we build
upon our previous publication by implementing a simple and efficient
non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2. SimpleSpeech 2
effectively combines the strengths of both autoregressive (AR) and
non-autoregressive (NAR) methods, offering the following key advantages: (1)
simplified data preparation; (2) straightforward model and loss design; and (3)
stable, high-quality generation performance with fast inference speed. Compared
to our previous publication, we present ({\romannumeral1}) a detailed analysis
of the influence of speech tokenizer and noisy label for TTS performance;
({\romannumeral2}) four distinct types of sentence duration predictors;
({\romannumeral3}) a novel flow-based scalar latent transformer diffusion
model. With these improvement, we show a significant improvement in generation
performance and generation speed compared to our previous work and other
state-of-the-art (SOTA) large-scale TTS models. Furthermore, we show that
SimpleSpeech 2 can be seamlessly extended to multilingual TTS by training it on
multilingual speech datasets. Demos are available on:
{https://dongchaoyang.top/SimpleSpeech2\_demo/}.
comment: Submit to TASLP
♻ ☆ xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu
This report introduces xGen-MM (also known as BLIP-3), a framework for
developing Large Multimodal Models (LMMs). The framework comprises meticulously
curated datasets, a training recipe, model architectures, and a resulting suite
of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen
initiative on foundation AI models. Our models undergo rigorous evaluation
across a range of tasks, including both single and multi-image benchmarks. Our
pre-trained base model exhibits strong in-context learning capabilities and the
instruction-tuned model demonstrates competitive performance among open-source
LMMs with similar model sizes. In addition, we introduce a safety-tuned model
with DPO, aiming to mitigate harmful behaviors such as hallucinations and
improve safety. We open-source our models, curated large-scale datasets, and
our fine-tuning codebase to facilitate further advancements in LMM research.
Associated resources will be available on our project page above.
♻ ☆ A Survey of Large Language Models for European Languages
Large Language Models (LLMs) have gained significant attention due to their
high performance on a wide range of natural language tasks since the release of
ChatGPT. The LLMs learn to understand and generate language by training
billions of model parameters on vast volumes of text data. Despite being a
relatively new field, LLM research is rapidly advancing in various directions.
In this paper, we present an overview of LLM families, including LLaMA, PaLM,
GPT, and MoE, and the methods developed to create and enhance LLMs for official
European Union (EU) languages. We provide a comprehensive summary of common
monolingual and multilingual datasets used for pretraining large language
models.
♻ ☆ WeKnow-RAG: An Adaptive Approach for Retrieval-Augmented Generation Integrating Web Search and Knowledge Graphs KDD
Large Language Models (LLMs) have greatly contributed to the development of
adaptive intelligent agents and are positioned as an important way to achieve
Artificial General Intelligence (AGI). However, LLMs are prone to produce
factually incorrect information and often produce "phantom" content that
undermines their reliability, which poses a serious challenge for their
deployment in real-world scenarios. Enhancing LLMs by combining external
databases and information retrieval mechanisms is an effective path. To address
the above challenges, we propose a new approach called WeKnow-RAG, which
integrates Web search and Knowledge Graphs into a "Retrieval-Augmented
Generation (RAG)" system. First, the accuracy and reliability of LLM responses
are improved by combining the structured representation of Knowledge Graphs
with the flexibility of dense vector retrieval. WeKnow-RAG then utilizes
domain-specific knowledge graphs to satisfy a variety of queries and domains,
thereby improving performance on factual information and complex reasoning
tasks by employing multi-stage web page retrieval techniques using both sparse
and dense retrieval methods. Our approach effectively balances the efficiency
and accuracy of information retrieval, thus improving the overall retrieval
process. Finally, we also integrate a self-assessment mechanism for the LLM to
evaluate the trustworthiness of the answers it generates. Our approach proves
its outstanding effectiveness in a wide range of offline experiments and online
submissions.
comment: 8 pages, 2 figures, technical report for 3rd place in Task 3 of Meta
KDD Cup 2024 CRAG Challenge
♻ ☆ Large Language Models Understand Layout ECAI-2024
Large language models (LLMs) demonstrate extraordinary abilities in a wide
range of natural language processing (NLP) tasks. In this paper, we show that,
beyond text understanding capability, LLMs are capable of processing text
layouts that are denoted by spatial markers. They are able to answer questions
that require explicit spatial perceiving and reasoning, while a drastic
performance drop is observed when the spatial markers from the original data
are excluded. We perform a series of experiments with the GPT-3.5, Baichuan2,
Llama2 and ChatGLM3 models on various types of layout-sensitive datasets for
further analysis. The experimental results reveal that the layout understanding
ability of LLMs is mainly introduced by the coding data for pretraining, which
is further enhanced at the instruction-tuning stage. In addition, layout
understanding can be enhanced by integrating low-cost, auto-generated data
approached by a novel text game. Finally, we show that layout understanding
ability is beneficial for building efficient visual question-answering (VQA)
systems.
comment: This paper has been accepted by ECAI-2024
♻ ☆ VHAKG: A Multi-modal Knowledge Graph Based on Synchronized Multi-view Videos of Daily Activities CIKM2024
Multi-modal knowledge graphs (MMKGs), which ground various non-symbolic data
(e.g., images and videos) into symbols, have attracted attention as resources
enabling knowledge processing and machine learning across modalities. However,
the construction of MMKGs for videos consisting of multiple events, such as
daily activities, is still in the early stages. In this paper, we construct an
MMKG based on synchronized multi-view simulated videos of daily activities.
Besides representing the content of daily life videos as event-centric
knowledge, our MMKG also includes frame-by-frame fine-grained changes, such as
bounding boxes within video frames. In addition, we provide support tools for
querying our MMKG. As an application example, we demonstrate that our MMKG
facilitates benchmarking vision-language models by providing the necessary
vision-language datasets for a tailored task.
comment: 5 pages, 4 figures, accepted by CIKM2024 Resource Track